data = read_delim('bank.csv', delim = ';')
##
## -- Column specification --------------------------------------------------------
## cols(
## age = col_double(),
## job = col_character(),
## marital = col_character(),
## education = col_character(),
## default = col_character(),
## balance = col_double(),
## housing = col_character(),
## loan = col_character(),
## contact = col_character(),
## day = col_double(),
## month = col_character(),
## duration = col_double(),
## campaign = col_double(),
## pdays = col_double(),
## previous = col_double(),
## poutcome = col_character(),
## y = col_character()
## )
data %>%
head() %>%
kable()
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
| 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
| 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
| 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
| 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
| 35 | management | single | tertiary | no | 747 | no | no | cellular | 23 | feb | 141 | 2 | 176 | 3 | failure | no |
df_status(data)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 age 0 0.00 0 0 0 0 numeric 67
## 2 job 0 0.00 0 0 0 0 character 12
## 3 marital 0 0.00 0 0 0 0 character 3
## 4 education 0 0.00 0 0 0 0 character 4
## 5 default 0 0.00 0 0 0 0 character 2
## 6 balance 357 7.90 0 0 0 0 numeric 2353
## 7 housing 0 0.00 0 0 0 0 character 2
## 8 loan 0 0.00 0 0 0 0 character 2
## 9 contact 0 0.00 0 0 0 0 character 3
## 10 day 0 0.00 0 0 0 0 numeric 31
## 11 month 0 0.00 0 0 0 0 character 12
## 12 duration 0 0.00 0 0 0 0 numeric 875
## 13 campaign 0 0.00 0 0 0 0 numeric 32
## 14 pdays 0 0.00 0 0 0 0 numeric 292
## 15 previous 3705 81.95 0 0 0 0 numeric 24
## 16 poutcome 0 0.00 0 0 0 0 character 4
## 17 y 0 0.00 0 0 0 0 character 2
There are no missing values
Check the response variable
table(data$y)/nrow(data)
##
## no yes
## 0.88476 0.11524
There are fewer yes on term deposit. Sampling method is needed on splitting train and test
plot1 = inspect_num(data) %>%
show_plot()+
theme_minimal()
ggplotly(plot1)
Most numeric characters are skewed. Based on the attribute information, the columns duration, campaign, pdays, and previous are based on the customer contacts. Probably better to leave this columns as is.
Try to transform balance column since this is probably log-normally distributed. There will be removed observations because of negative and 0 values.
plot2 = data %>%
ggplot(aes(balance))+
geom_histogram(fill = 'steelblue')+
scale_x_log10(label = comma)+
labs(y = '', x = 'Log balance', title = 'Log distribution of balance')+
theme_minimal()
ggplotly(plot2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Check if there are still outliers after log transformation
outliers_balance = boxplot(log(data$balance), plot = FALSE)$out
data %>%
filter(log(balance) %in% outliers_balance, balance > 100) %>%
kable()
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 60 | retired | married | primary | no | 71188 | no | no | cellular | 6 | oct | 205 | 1 | -1 | 0 | unknown | no |
quantile(data$balance)
## 0% 25% 50% 75% 100%
## -3313 69 444 1480 71188
It would be better to remove this entry since this might affect the model
corr = data %>%
select_if(is.numeric) %>%
cor()
corrplot(corr, method = 'number')
The fields previous and pdays are highly correlated because both are related to campaign contacts. Values on both of these fields are also dependent.
Check frequency of categorical data
inspect_cat(data) %>%
show_plot()
Plot first three demographic data related to bank client
plot3 = data %>%
ggplot(aes(x = job, fill = education))+
geom_bar(position = 'fill')+
coord_flip()+
facet_grid(~marital)+
labs(y = 'proportions', x= '', title = 'Job by marital status and education')+
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent)+
scale_fill_brewer(palette = 'GnBu')+
theme_minimal()+
theme(legend.position = 'bottom')
ggplotly(plot3)
Some insights on this graph
There is no divorced student in our dataset. All divorced with unknown job have tertiary education. Entrepreneur and retired have diversed education and marital status.
We can see here the distribution of education and marital status per job. Most management jobs have tertiary level of education whereas the services and admin are dominantly secondary education regardless of your marital status.
Plot the other three bank client data, housing loan, personal loan, and default
plot4 = data %>%
ggplot(aes(default, fill = loan))+
geom_bar(position = 'fill', alpha = .7)+
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent) +
scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
labs(x= 'Credit card defaulted', fill = 'Personal Loan',
title = 'Percentage of individuals with personal loan by credit card default')+
theme(legend.position = 'top')+
theme_minimal()
plot4
plot5 = data %>%
ggplot(aes(default, fill = housing))+
geom_bar(position = 'fill', alpha = .7)+
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent) +
scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
labs(x= 'Credit card has default', fill = 'Housing loan',
title = 'Percentage of individuals with housing loan by credit card default')+
theme_minimal()
plot5
Unlike the first graph, the proportion of individuals with a housing loan is the same whether they have a default credit. This is probably because housing loan is normally taken by anyone. The variable personal loan would be a better predictor whether a person will default compared to housing loan field.
plot6 = data %>%
group_by(poutcome, y) %>%
count() %>%
group_by(y) %>%
mutate(percentage = n/ sum(n)) %>%
ggplot(aes(reorder(poutcome, percentage), percentage, fill = y))+
geom_col(position = 'dodge', alpha = .7)+
scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
labs(x = 'Previous campaign outcome', y = 'Percentage', fill = 'W/ term deposit',
title = 'Outcome of previous campaign and term deposit')+
theme_minimal()+
coord_flip()
plot6
plot8 = data %>%
ggplot(aes(balance, fill = y, color = y))+
geom_density(alpha = .3)+
scale_x_log10(labels = comma)+
scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray'))+
scale_color_manual(values = c('yes' = 'tomato', 'no' = 'gray'), guide = F)+
labs(title = 'Balance distribution by obtained term deposit', x = 'Log balance', fill = 'W/ term deposit')+
theme_minimal()
plot8
The peak of clients with term deposit is slightly to the right implying those who get term deposit have higher balance.
plot9= data %>%
group_by(job, loan) %>%
count() %>%
group_by(loan) %>%
mutate(percentage = n/ sum(n)) %>%
mutate(highlight = ifelse((job %in% c('blue-collar','admin.','services','entrepreneur')),'yes','no')) %>%
ggplot(aes(reorder(job, percentage), percentage, fill = loan))+
geom_col(position = 'dodge', aes(alpha = highlight))+
scale_alpha_manual(values = c('yes' = 1, 'no' = .4), guide = F)+
scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray'))+
labs(x = '', y = 'Percentage', fill = 'Has loan',
title = 'Percentage of clients with personal loan by job')+
theme(legend.position = 'none')+
theme_minimal()+
coord_flip()
plot9
Blue collar job, admin, services, and entrepreneur jobs are more likely to get a personal loan.
Note: Observations with balance less than 5000 only
plot10 = data %>%
filter(balance < 5000) %>%
group_by(education) %>%
mutate(median_balance = median(balance)) %>%
ggplot(aes(balance))+
geom_histogram(binwidth = 250, fill = 'steelblue')+
geom_vline(aes(xintercept = median_balance), color = 'black', linetype = 'dashed')+
theme_minimal()+
facet_grid(education ~ .)+
labs(title = 'Distribution of balance by education', y = '', x = 'balance', caption = 'Black dashed line is median')
ggplotly(plot10)
Almost same shape (right tailed) with tertiary level of education having the highest median and max. Primary and secondary levels of education also have higher occurrences of negative balance
Note: Balance > 0. Removed around 700 observations
plot11 = data %>%
filter(balance > 0) %>%
ggplot(aes(age, balance))+
geom_point(alpha = .7, aes(color = y))+
scale_y_log10(label = comma)+
geom_smooth(color = 'black', size = 1.2, alpha = .8)+
labs(x = 'age', y = 'Log balance', title = 'Age vs Log balance', color = 'Term deposit')+
scale_color_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
theme_minimal()
ggplotly(plot11)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is a slightly increasing trend which is more noticeable on higher age groups. Also most on the bottom points did not avail term deposit.
plot12 = data %>%
ggplot(aes(x = as.factor(y), y = duration, fill = y))+
geom_violin(width = 1.4, alpha = .7)+
stat_summary(fun = median, geom = 'point', shape = 22 , size = 2)+
scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray'))+
labs(x = 'W/ term deposit', y= 'contact duration', title = 'Contact duration distribution by term deposit',
caption = 'Violin plot with median (square)')+
theme_minimal()+
theme(legend.position = 'none')
ggplotly(plot12)
From the data, those who get a term deposit has higher contact duration.
Change character into factors
data = data %>%
mutate_if(is.character, as.factor)
glimpse(data)
## Rows: 4,521
## Columns: 17
## $ age <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 3...
## $ job <fct> unemployed, services, management, management, blue-collar...
## $ marital <fct> married, married, single, married, married, single, marri...
## $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertia...
## $ default <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
## $ balance <dbl> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374,...
## $ housing <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes,...
## $ loan <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no...
## $ contact <fct> cellular, cellular, cellular, unknown, unknown, cellular,...
## $ day <dbl> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, ...
## $ month <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, ap...
## $ duration <dbl> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113,...
## $ campaign <dbl> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, ...
## $ pdays <dbl> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, ...
## $ previous <dbl> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, ...
## $ poutcome <fct> unknown, failure, failure, unknown, unknown, failure, oth...
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, ...
Identify variables with near zero variance
nz = nearZeroVar(data)
colnames(data[,nz])
## [1] "default" "pdays"
We can remove variables default and pdays since they don’t have strong predicting powers. Also pdays is highly correlated with previous.
Remove these columns and remove the outlier
data = data %>%
filter(!balance == max(balance))
data = data[,-nz]
80% training and 20% test. Split the sample on response variable y so that test and train set will have the same proportion of y
in_train = createDataPartition(data$y, p = .8, list = F)
train_data = data[in_train,]
test_data = data[-in_train,]
Create train with down sample and upsample to balance response variable
#834 rows
train_data_down = downSample(x = train_data %>%
select(-y),
y = train_data$y,
yname = 'y')
#6400 rows
train_data_up = upSample(x = train_data %>%
select(-y),
y = train_data$y,
yname = 'y')
table(train_data_down$y)/nrow(train_data_down)
##
## no yes
## 0.5 0.5
Set up train control for resampling.
train_control = trainControl(method = 'cv', number = 10, classProbs =T, summaryFunction = multiClassSummary)
Setup parallel computing
library(doParallel)
cores = detectCores()
cl = makeCluster(cores[1]-1)
#Register cluster
registerDoParallel(cl)
set.seed(100)
glm_model = train(y ~., data = train_data_down, method = 'glm', family = 'binomial', trControl = train_control)
glm_model
## Generalized Linear Model
##
## 834 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 750, 751, 751, 750, 751, 751, ...
## Resampling results:
##
## logLoss AUC prAUC Accuracy Kappa F1 Sensitivity
## 0.4740668 0.8720093 0.8400509 0.798537 0.5969431 0.8041056 0.8247387
## Specificity Pos_Pred_Value Neg_Pred_Value Precision Recall
## 0.7720674 0.7870392 0.8160755 0.7870392 0.8247387
## Detection_Rate Balanced_Accuracy
## 0.4124068 0.798403
Predict using glm model
glm_preds = predict(glm_model, test_data)
confusionMatrix(glm_preds, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 664 21
## yes 135 83
##
## Accuracy : 0.8272
## 95% CI : (0.801, 0.8514)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.426
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79808
## Specificity : 0.83104
## Pos Pred Value : 0.38073
## Neg Pred Value : 0.96934
## Prevalence : 0.11517
## Detection Rate : 0.09192
## Detection Prevalence : 0.24142
## Balanced Accuracy : 0.81456
##
## 'Positive' Class : yes
##
This model has high accuracy but low positive predictive value. There were a lot of predicted yes as compared to actual yes. In real life this would incur additional cost if we predict a client to get a term deposit even though the client would not. But if there’s a budget, this might be acceptable
Train using the same model but with the upsampled train data
set.seed(100)
glm_model_up = train(y ~., data = train_data_up, method = 'glm', family = 'binomial', trControl = train_control)
glm_model_up
## Generalized Linear Model
##
## 6400 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 5760, 5760, 5760, 5760, 5760, 5760, ...
## Resampling results:
##
## logLoss AUC prAUC Accuracy Kappa F1 Sensitivity
## 0.4041624 0.9077314 0.8983478 0.8235938 0.6471875 0.82678 0.841875
## Specificity Pos_Pred_Value Neg_Pred_Value Precision Recall
## 0.8053125 0.8126312 0.8361503 0.8126312 0.841875
## Detection_Rate Balanced_Accuracy
## 0.4209375 0.8235938
glm_model_up_pred = predict(glm_model_up, test_data)
confusionMatrix(glm_model_up_pred, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 662 24
## yes 137 80
##
## Accuracy : 0.8217
## 95% CI : (0.7951, 0.8461)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4059
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.76923
## Specificity : 0.82854
## Pos Pred Value : 0.36866
## Neg Pred Value : 0.96501
## Prevalence : 0.11517
## Detection Rate : 0.08859
## Detection Prevalence : 0.24031
## Balanced Accuracy : 0.79888
##
## 'Positive' Class : yes
##
It seems that the model with upsampled sample is slightly more accurate but still have low positive predictive value.
One advantage of using simpler model such as logistic regression is that you can check the estimate and significance of each variable
summary(glm_model_up)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.4594 -0.5858 -0.0060 0.5976 2.3582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.871e-02 3.451e-01 -0.286 0.774859
## age -1.193e-02 4.492e-03 -2.657 0.007889 **
## `jobblue-collar` -8.191e-01 1.442e-01 -5.681 1.34e-08 ***
## jobentrepreneur -5.542e-01 2.283e-01 -2.427 0.015214 *
## jobhousemaid -3.224e-01 2.535e-01 -1.272 0.203445
## jobmanagement -4.005e-01 1.468e-01 -2.728 0.006373 **
## jobretired 6.673e-01 1.955e-01 3.413 0.000642 ***
## `jobself-employed` -4.640e-01 2.186e-01 -2.123 0.033782 *
## jobservices -6.448e-01 1.671e-01 -3.860 0.000114 ***
## jobstudent 4.766e-01 2.552e-01 1.867 0.061883 .
## jobtechnician -4.057e-01 1.360e-01 -2.984 0.002847 **
## jobunemployed -4.347e-01 2.322e-01 -1.872 0.061157 .
## jobunknown 7.504e-01 3.457e-01 2.171 0.029931 *
## maritalmarried -1.252e-01 1.130e-01 -1.108 0.267689
## maritalsingle -5.690e-02 1.328e-01 -0.429 0.668284
## educationsecondary 1.691e-01 1.257e-01 1.345 0.178728
## educationtertiary 3.987e-01 1.427e-01 2.795 0.005196 **
## educationunknown -4.496e-01 2.288e-01 -1.965 0.049398 *
## balance 2.468e-06 1.308e-05 0.189 0.850370
## housingyes -3.929e-01 8.354e-02 -4.703 2.56e-06 ***
## loanyes -9.621e-01 1.192e-01 -8.072 6.90e-16 ***
## contacttelephone 1.178e-01 1.457e-01 0.809 0.418560
## contactunknown -1.330e+00 1.238e-01 -10.742 < 2e-16 ***
## day 1.040e-02 4.914e-03 2.117 0.034273 *
## monthaug -6.591e-01 1.447e-01 -4.555 5.25e-06 ***
## monthdec 3.029e-01 4.817e-01 0.629 0.529527
## monthfeb -3.153e-01 1.848e-01 -1.706 0.087934 .
## monthjan -1.780e+00 2.368e-01 -7.517 5.61e-14 ***
## monthjul -1.149e+00 1.557e-01 -7.378 1.61e-13 ***
## monthjun -2.121e-01 1.808e-01 -1.173 0.240616
## monthmar 1.552e+00 2.729e-01 5.686 1.30e-08 ***
## monthmay -8.038e-01 1.436e-01 -5.597 2.18e-08 ***
## monthnov -9.065e-01 1.582e-01 -5.730 1.00e-08 ***
## monthoct 1.184e+00 2.390e-01 4.956 7.20e-07 ***
## monthsep 1.067e-01 3.319e-01 0.321 0.747899
## duration 6.079e-03 1.751e-04 34.720 < 2e-16 ***
## campaign -1.165e-01 1.813e-02 -6.425 1.32e-10 ***
## previous -5.836e-02 2.867e-02 -2.036 0.041764 *
## poutcomeother 1.056e+00 1.705e-01 6.194 5.86e-10 ***
## poutcomesuccess 2.625e+00 2.223e-01 11.809 < 2e-16 ***
## poutcomeunknown -4.235e-01 1.361e-01 -3.112 0.001856 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8872.3 on 6399 degrees of freedom
## Residual deviance: 5076.6 on 6359 degrees of freedom
## AIC: 5158.6
##
## Number of Fisher Scoring iterations: 6
Sample significant variables
Try a more complex model with tuning parameters to increase model accuracy.
set.seed(100)
#set up tuning grid parameter
# 2,3,4 different mtry
tunegrid = expand.grid(.mtry=c(2:8))
rf_model = train(y ~., data = train_data_up, method = 'rf', tuneGrid = tunegrid, trControl = train_control)
rf_model
## Random Forest
##
## 6400 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 5760, 5760, 5760, 5760, 5760, 5760, ...
## Resampling results across tuning parameters:
##
## mtry logLoss AUC prAUC Accuracy Kappa F1
## 2 0.32940082 0.9648945 0.9605005 0.9073438 0.8146875 0.9054030
## 3 0.21548497 0.9936597 0.9860055 0.9501563 0.9003125 0.9482608
## 4 0.15130195 0.9987207 0.9824974 0.9654687 0.9309375 0.9642947
## 5 0.11636163 0.9995269 0.9568789 0.9709375 0.9418750 0.9700589
## 6 0.09810270 0.9996938 0.9292311 0.9725000 0.9450000 0.9716944
## 7 0.08776087 0.9997627 0.8877371 0.9739062 0.9478125 0.9731940
## 8 0.08230689 0.9997446 0.8427196 0.9756250 0.9512500 0.9750030
## Sensitivity Specificity Pos_Pred_Value Neg_Pred_Value Precision
## 0.8871875 0.9275000 0.9249028 0.8919291 0.9249028
## 0.9140625 0.9862500 0.9852632 0.9199784 0.9852632
## 0.9334375 0.9975000 0.9973251 0.9375371 0.9973251
## 0.9428125 0.9990625 0.9990238 0.9459864 0.9990238
## 0.9459375 0.9990625 0.9990260 0.9488463 0.9990260
## 0.9487500 0.9990625 0.9990249 0.9513363 0.9990249
## 0.9521875 0.9990625 0.9990280 0.9544558 0.9990280
## Recall Detection_Rate Balanced_Accuracy
## 0.8871875 0.4435938 0.9073438
## 0.9140625 0.4570312 0.9501563
## 0.9334375 0.4667188 0.9654687
## 0.9428125 0.4714062 0.9709375
## 0.9459375 0.4729687 0.9725000
## 0.9487500 0.4743750 0.9739062
## 0.9521875 0.4760937 0.9756250
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
rf_model_pred = predict(rf_model, test_data)
confusionMatrix(rf_model_pred, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 756 65
## yes 43 39
##
## Accuracy : 0.8804
## 95% CI : (0.8574, 0.9008)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 0.68420
##
## Kappa : 0.3537
##
## Mcnemar's Test P-Value : 0.04331
##
## Sensitivity : 0.37500
## Specificity : 0.94618
## Pos Pred Value : 0.47561
## Neg Pred Value : 0.92083
## Prevalence : 0.11517
## Detection Rate : 0.04319
## Detection Prevalence : 0.09081
## Balanced Accuracy : 0.66059
##
## 'Positive' Class : yes
##
This model has higher accuracy but lower sensitivity in prediction. The model also overfitted in predicting yes. Probably because training data is balanced whereas test data is not. This model, however, has higher specificity and lower logloss and accurately predicts those that are TRUE NEGATIVE better.
Use random forest on downsampled data.
set.seed(100)
#set up tuning grid parameter
# 2,3,4 different mtry
tunegrid = expand.grid(.mtry=c(2:8))
rf_model_down = train(y ~., data = train_data_down, method = 'rf', tuneGrid = tunegrid, trControl = train_control)
rf_model_down
## Random Forest
##
## 834 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 750, 751, 751, 750, 751, 751, ...
## Resampling results across tuning parameters:
##
## mtry logLoss AUC prAUC Accuracy Kappa F1
## 2 0.5050709 0.8596773 0.8315820 0.7829891 0.5660924 0.7900410
## 3 0.4793906 0.8736657 0.8440478 0.8057803 0.6117180 0.8047278
## 4 0.4619833 0.8793575 0.8517905 0.8201377 0.6403791 0.8141260
## 5 0.4540130 0.8779485 0.8491756 0.8225617 0.6451712 0.8150070
## 6 0.4428250 0.8823579 0.8556516 0.8189185 0.6377818 0.8105794
## 7 0.4400921 0.8810941 0.8522443 0.8224900 0.6449387 0.8141937
## 8 0.4388059 0.8796008 0.8507686 0.8201520 0.6402436 0.8127466
## Sensitivity Specificity Pos_Pred_Value Neg_Pred_Value Precision
## 0.8129501 0.7533101 0.7705675 0.8005727 0.7705675
## 0.7984901 0.8135889 0.8146338 0.8021661 0.8146338
## 0.7913473 0.8493031 0.8427088 0.8051856 0.8427088
## 0.7840302 0.8613240 0.8531463 0.8015348 0.8531463
## 0.7790360 0.8587689 0.8482237 0.7977085 0.8482237
## 0.7837979 0.8612079 0.8502314 0.8021353 0.8502314
## 0.7838560 0.8563879 0.8473055 0.8006350 0.8473055
## Recall Detection_Rate Balanced_Accuracy
## 0.8129501 0.4064400 0.7831301
## 0.7984901 0.3992255 0.8060395
## 0.7913473 0.3955967 0.8203252
## 0.7840302 0.3919822 0.8226771
## 0.7790360 0.3895726 0.8189024
## 0.7837979 0.3919679 0.8225029
## 0.7838560 0.3919822 0.8201220
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
rf_model_pred_down = predict(rf_model_down, test_data)
confusionMatrix(rf_model_pred_down, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 628 20
## yes 171 84
##
## Accuracy : 0.7885
## 95% CI : (0.7604, 0.8147)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3639
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.80769
## Specificity : 0.78598
## Pos Pred Value : 0.32941
## Neg Pred Value : 0.96914
## Prevalence : 0.11517
## Detection Rate : 0.09302
## Detection Prevalence : 0.28239
## Balanced Accuracy : 0.79684
##
## 'Positive' Class : yes
##
Now this model has higher sensitivity but overall lower accuracy. Random forest model is less robust compared to GLM and is more affected by the sampling method on the training data.
#model setup
set.seed(100)
svm_grid = expand.grid(C = seq(0, 2, length = 10))
# fit model
svm_model = train(y ~., data = train_data_up, method = 'svmLinear', tuneGrid = svm_grid, trControl = train_control)
svm_model
## Support Vector Machines with Linear Kernel
##
## 6400 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 5760, 5760, 5760, 5760, 5760, 5760, ...
## Resampling results across tuning parameters:
##
## C logLoss AUC prAUC Accuracy Kappa F1
## 0.0000000 NaN NaN NaN NaN NaN NaN
## 0.2222222 0.4066066 0.9061504 0.8961309 0.8201563 0.6403125 0.8235164
## 0.4444444 0.4065720 0.9060430 0.8960407 0.8212500 0.6425000 0.8242844
## 0.6666667 0.4064384 0.9060391 0.8959929 0.8212500 0.6425000 0.8241351
## 0.8888889 0.4065123 0.9060342 0.8960005 0.8212500 0.6425000 0.8244095
## 1.1111111 0.4067220 0.9060420 0.8960048 0.8217188 0.6434375 0.8247215
## 1.3333333 0.4066551 0.9060439 0.8960016 0.8207813 0.6415625 0.8238538
## 1.5555556 0.4067011 0.9060254 0.8959933 0.8218750 0.6437500 0.8249997
## 1.7777778 0.4067157 0.9060234 0.8959825 0.8214063 0.6428125 0.8243619
## 2.0000000 0.4067915 0.9059932 0.8959376 0.8218750 0.6437500 0.8247914
## Sensitivity Specificity Pos_Pred_Value Neg_Pred_Value Precision
## NaN NaN NaN NaN NaN
## 0.8387500 0.8015625 0.8091409 0.8325960 0.8091409
## 0.8384375 0.8040625 0.8108553 0.8328585 0.8108553
## 0.8375000 0.8050000 0.8114569 0.8321921 0.8114569
## 0.8390625 0.8034375 0.8105103 0.8332504 0.8105103
## 0.8387500 0.8046875 0.8113994 0.8332242 0.8113994
## 0.8378125 0.8037500 0.8106235 0.8321405 0.8106235
## 0.8396875 0.8040625 0.8110855 0.8339685 0.8110855
## 0.8381250 0.8046875 0.8113334 0.8327172 0.8113334
## 0.8384375 0.8053125 0.8118613 0.8330976 0.8118613
## Recall Detection_Rate Balanced_Accuracy
## NaN NaN NaN
## 0.8387500 0.4193750 0.8201563
## 0.8384375 0.4192187 0.8212500
## 0.8375000 0.4187500 0.8212500
## 0.8390625 0.4195313 0.8212500
## 0.8387500 0.4193750 0.8217188
## 0.8378125 0.4189063 0.8207813
## 0.8396875 0.4198438 0.8218750
## 0.8381250 0.4190625 0.8214063
## 0.8384375 0.4192188 0.8218750
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 1.555556.
svm_model_pred = predict(svm_model, test_data)
confusionMatrix(svm_model_pred, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 659 23
## yes 140 81
##
## Accuracy : 0.8195
## 95% CI : (0.7928, 0.8441)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4053
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7788
## Specificity : 0.8248
## Pos Pred Value : 0.3665
## Neg Pred Value : 0.9663
## Prevalence : 0.1152
## Detection Rate : 0.0897
## Detection Prevalence : 0.2447
## Balanced Accuracy : 0.8018
##
## 'Positive' Class : yes
##
#model setup
set.seed(100)
svm_grid = expand.grid(C = seq(0, 2, length = 10))
# fit model
svm_model_down = train(y ~., data = train_data_down, method = 'svmLinear', tuneGrid = svm_grid, trControl = train_control)
svm_model_down
## Support Vector Machines with Linear Kernel
##
## 834 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 750, 751, 751, 750, 751, 751, ...
## Resampling results across tuning parameters:
##
## C logLoss AUC prAUC Accuracy Kappa F1
## 0.0000000 NaN NaN NaN NaN NaN NaN
## 0.2222222 0.4619795 0.8774985 0.8467440 0.8093087 0.6185298 0.8162311
## 0.4444444 0.4636142 0.8776036 0.8466057 0.8056942 0.6112669 0.8114941
## 0.6666667 0.4651613 0.8776755 0.8466338 0.8056942 0.6112613 0.8119909
## 0.8888889 0.4617902 0.8772690 0.8459624 0.8020941 0.6040870 0.8082144
## 1.1111111 0.4649215 0.8775040 0.8461614 0.8044750 0.6088531 0.8103528
## 1.3333333 0.4656764 0.8772136 0.8459026 0.8056655 0.6112298 0.8109730
## 1.5555556 0.4634754 0.8771583 0.8458640 0.8056799 0.6112441 0.8119140
## 1.7777778 0.4641748 0.8771597 0.8459145 0.8044607 0.6088093 0.8103678
## 2.0000000 0.4632163 0.8766965 0.8454792 0.8056512 0.6111903 0.8113126
## Sensitivity Specificity Pos_Pred_Value Neg_Pred_Value Precision
## NaN NaN NaN NaN NaN
## 0.8463415 0.7721254 0.7901163 0.8354428 0.7901163
## 0.8366434 0.7745064 0.7895366 0.8273075 0.7895366
## 0.8390825 0.7720674 0.7882624 0.8288409 0.7882624
## 0.8343206 0.7696864 0.7852680 0.8242163 0.7852680
## 0.8367015 0.7720674 0.7872328 0.8272494 0.7872328
## 0.8343206 0.7768293 0.7902680 0.8254984 0.7902680
## 0.8390825 0.7720674 0.7879856 0.8288531 0.7879856
## 0.8367015 0.7720093 0.7871557 0.8270850 0.7871557
## 0.8367015 0.7743902 0.7888671 0.8274406 0.7888671
## Recall Detection_Rate Balanced_Accuracy
## NaN NaN NaN
## 0.8463415 0.4231784 0.8092334
## 0.8366434 0.4183735 0.8055749
## 0.8390825 0.4195783 0.8055749
## 0.8343206 0.4171830 0.8020035
## 0.8367015 0.4183735 0.8043844
## 0.8343206 0.4171830 0.8055749
## 0.8390825 0.4195783 0.8055749
## 0.8367015 0.4183735 0.8043554
## 0.8367015 0.4183735 0.8055459
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.2222222.
svm_model_pred_down = predict(svm_model_down, test_data)
confusionMatrix(svm_model_pred_down, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 667 21
## yes 132 83
##
## Accuracy : 0.8306
## 95% CI : (0.8045, 0.8545)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4322
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79808
## Specificity : 0.83479
## Pos Pred Value : 0.38605
## Neg Pred Value : 0.96948
## Prevalence : 0.11517
## Detection Rate : 0.09192
## Detection Prevalence : 0.23810
## Balanced Accuracy : 0.81644
##
## 'Positive' Class : yes
##
SVM with downsampled training data has similar statistics with SVM trained on upsampled training data. The results of SVM also have very similar metrics with the results from GLM.
Models from upsampled data
models_upsampled = list(glm_up = glm_model_up, rf_up = rf_model, svm_up = svm_model)
resampled_upsampled =resamples(models_upsampled)
bwplot(resampled_upsampled, metric = c('Sensitivity','Specificity','Accuracy','AUC'), main = 'Resampled metrics on upsampled training data')
From the resampling results of different models, it is evident that the random forest model has the highest accuracy metrics. Also the confidence intervals don’t overlap. However, we know that based on the confusion matrix with test data, the random forest model overfitted and has low sensitivity and wasn’t able to predict the response variable “YES” well.
Models from downsampled data
models_downsampled = list(glm_down = glm_model, rf_down = rf_model_down, svm_down = svm_model_down)
resampled_downsampled =resamples(models_downsampled)
bwplot(resampled_downsampled, metric = c('Accuracy','AUC','Sensitivity','Specificity'), main = 'Resampled metrics on downsampled training data')
For downsampled training data, the models have similar metric values with overlapping confidence interval.
compare_models(glm_model, svm_model_down)
##
## One Sample t-test
##
## data: x
## t = -1.875, df = 9, p-value = 0.09355
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.023767428 0.002224112
## sample estimates:
## mean of x
## -0.01077166
There is no significant difference on the resampling results from GLM and SVM models. Therefore, it would be better to pick GLM since it is more explainable and scalable than SVM.
When the resampling results from Random Forest model were compared with the resampling results from the other two models, it is evident that RF has higher accuracy metrics. The problem with RF lies more on the data it was trained on. Even though it has high resampled accuracy, it wasn’t able to correctly predict TRUE POSITIVES on test set.
One would choose the GLM over RF because of simplicity and accuracy. Moreover, the test predictions are acceptable even though it predicted more “YES”. The downside of the GLM model would be the cost of marketing campaign for FALSE POSITIVE clients.
plot(varImp(glm_model_up), top = 10, main = 'Top 10 important variables in the GLM')
plot(varImp(rf_model), top = 10, main = 'Top 10 important variables in the Random Forest Model')
In both models, both the duration and poutcome are included in the top 10 important variables.